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Abstract 


We Study the problem of hierarchical clustering on planar graphs. We formulate 
this in terms of an LP relaxation of ultrametric rounding. To solve this LP effi¬ 
ciently we introduce a dual cutting plane scheme that uses minimum cost perfect 
matching as a subroutine in order to efficiently explore the space of planar parti¬ 
tions. We apply our algorithm to the problem of hierarchical image segmentation. 


1 Introduction 

In this work, we formulate hierarchical image segmentation from the perspective of estimating an 
ultrametric over the set of image pixels that agrees closely with an input set of noisy pairwise dis¬ 
tances. An ultrametric is a metric space in which the triangle inequality is replaced by the ultramet¬ 
ric inequality d{u^ v) < max{d(i4, ic), d{v^ re)}. This inequality captures the transitive property of 
clustering (if u and w are in the same cluster and v and w are in the same cluster, then u and v must 
also be in the same cluster). Thresholding an ultrametric immediately yields a partition into sets 
whose diameter is less than the given threshold and varying the threshold naturally produces a hier¬ 
archical clustering in which clusters at high thresholds are composed of clusters at lower thresholds. 

Inspired by the approach of Q, our method represents an ultrametric explicitly as a hierarchical 
collection of segmentations. Determining the appropriate segmentation at a single distance threshold 
is equivalent to finding a minimum-weight multicut in a graph with both positive and negative edge 
weights EKHlEllIIlEolEIlIlKIlllTl. Finding an ultrametric imposes the additional constraint 
that these multicuts are hierarchically consistent across different thresholds. We focus on the case 
where the input distances are specified by a planar graph which arises naturally in the domain of 
image segmentation where elements are pixels or superpixels and distances are defined between 
neighbors. This allows us to exploit fast combinatorial algorithms for partitioning planar graphs that 
yield tighter LP relaxations than the local polytope relaxation 1201 . 

This paper is organized as follows. We first introduce the ultrametric rounding problem and the 
relation between multicuts and ultrametrics. We then introduce a LP relaxation that uses a delayed 
column generation approach that exploits planarity to efficiently find cuts using the classic reduction 
to minimum-weight perfect matching usimiioi. We apply our algorithm to the task of natural 
image segmentation on the Berkeley Segmentation Data Set benchmark ca. We show compelling 
visual results and demonstrate that our algorithm converges rapidly and produces near optimal or 
optimal solutions in practice with guarantees. 


*JY acknowledges the support of Experian, CF acknowledges support of NSF grants IIS-1253538 and DBI- 
1262547 
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2 Ultrametric Rounding and Multicuts 


Let G = (y, be a weighted graph with non-negative edge weights 0 indexed by edges e = 
v) G E. Our goal is to find an ultrametric distance d(^u,v) over vertices of the graph that is 
close to 0 in the sense that the distortion ||^(u,v) ~ II 2 minimized. We begin by 

reformulating this rounding problem in terms of finding a set of nested multicuts in a family of 
weighted graphs. 

We specify a partitioning or multicut of the vertices of the graph G into components using a binary 
vector X G {0,where Xe = 1 indicates that the edge e = v) is “cut” and that the vertices 
u and V associated with the edge are in separate components of the partition. We use MCUT(G) 
to denote the set of binary indicator vectors X that represent valid multicuts of the graph G. For 
notational simplicity, in the remainder of the paper we frequently omit the dependence on G which 
is given as a fixed input. 

A necessary and sufficient condition for an indicator vector X to define a valid multicut in G is that 
for every cycle of edges, if one edge on the cycle is cut then at least one other edge in the cycle must 
also be cut. Let G denote the set of all cycles in G where each cycle c G G is a set of edges and 
c — e is the set of edges in cycle c excluding edge e. We can express MCUT in terms of these cycle 
inequalities as: 


MCUT = 


^e{o,i} 


1^1 


y] Xe > Xe,Vc eC,eec 

eGc—e 


( 1 ) 


A hierarchical clustering of a graph can be described by a nested collection of multicuts. We denote 
the space of valid hierarchical partitions with L layers by Ql which we represent by a set of L 
edge-indicator vectors A = (X^, ..., X^) in which any cut edge remains cut at all finer 

layers of the hierarchy. 

CIl = ... X^) : x' e MCUT, X' > X'+i V^} (2) 


Given a valid hierarchical clustering X, an ultrametric d can be specified over the vertices of the 
graph by choosing a sequence of real values 0 = < ... < that indicate a distance 

threshold associated with each level I of the hierarchical clustering. The ultrametric distance d 
specified by the pair (X, assigns a distance to each pair of vertices d(^u,v) based on the coarsest 
level of the clustering at which they remain in separate clusters. For pairs corresponding to an edge 
in the graph {u,v) = e G X we can write this explicitly in terms of the multicut indicator vectors 
as: 


L 

de= max <5*X'= y (5'[X' > X*+i] (3) 

ie{o,i,...,L} Go 

We assume by convention that X® = 1 and X^+^ = 0. Pairs {u, v) that do not correspond to an 
edge in the original graph can still be assigned a unique distance based on the coarsest level I at 
which they lie in different connected components of the cut specified by X^ 

To compute the quality of an ultrametric d with respect to an input set of edge weights 0, we measure 
the squared L 2 difference between the edge weights and the ultrametric distance ||6> — dUl- To write 
this compactly in terms of multicut indicator vectors, we construct a set of weights for each edge 
and layer, denoted 0^ so that = \\de — - These weights are given explicitly by the 

telescoping series: 

()e = Pef di = \\ee-S^f-\\ee-S^-^f W>1 (4) 

We use 0 ^ G to denote the vector containing for all e G X. 
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For a fixed number of levels L and fixed set of thresholds 5, the problem of finding the nearest 
ultrametric d can then be written as an integer linear program (ILP) over the edge cut indicators. 


min ll^e ~ de\\‘^ = 


eeE 




eeE 


0e - > X'+l] 


1=0 


= min 


= “if EE^eV 

^e^iL , p. ^ TP 

1=0 eeE 


= min V 6 >' ■ X' 


1=1 


( 5 ) 


This optimization corresponds to solving a collection of minimum-weight multicut problems where 
the multicuts are constrained to be hierarchically consistent. 


Computing minimum-weight multicuts (also known as correlation clustering) is NP hard even in the 
case of planar graphs A direct approach to finding an approximate solution to Eq[^is to relax 
the integrality constraints on and instead optimize over the whole polytope defined by the set of 
cycle inequalities. We write CYC to indicate the poly tope of real valued indicator vectors X that 
satisfying the cycle inequalities 


CYC = J X e [0, l]l®l : ^ Xe > Xe,Vc €C,eGc\ ( 6 ) 

V eee—e ) 

and use O^Lio denote the corresponding relaxation of given by 

= {(Xf X^,... X^) : X' e CYC, X' > X'+i Ml} 

While the polytope CYC contains non-integral vertices (it is not the convex hull of MCUT), the 
integral vertices of CYC do correspond exactly to the set of valid multicuts (121 . 

In practice, we found that applying a straightforward cutting-plane approach that successively adds 
violated cycle inequalities to this relaxation of Eq requires far too many constraints and is too 
slow to be useful. Instead, we develop a column generation approach tailored for planar graphs that 
allows for efficient and accurate approximate inference. 


3 The Cut Cone and Planar Multicuts 

Consider a partition of a planar graph into two disjoint sets of nodes. We denote the space of 
indicator vectors corresponding to such two-way cuts by CUT. A cut may yield more than two 
connected components but it can not produce every possible multicut (e.g., it can not split a triangle 
of three nodes into three separate components). Let E G {0,l}l^l^l^^^lbean indicator matrix 
where each column specifies a valid two-way cut with = 1 if and only if edge e is cut in two- 
way cut k. The indicator vector of any multicut in a planar graph can be generated by a suitable 
linear combination of of cuts (columns of Z) that isolate the individual components from the rest of 
the graph where the weight of each such cut is |. 

Let 7 G be a vector specifying a positive weighted combination of cuts. The set CUT^ = 

{Zj : 7 > 0} is the conic hull of CUT or “cut cone”. Since any multicut can be expressed as a 
superposition of cuts, the cut cone is identical to the conic hull of MCUT. This equivalence suggests 
an LP relaxation of the minimum-cost multicut given by 

min^-E 7 s.t. E 7 < 1 (7) 

7>0 


3 







7" 


\ 


7: 

1 Zi Z2 ot" 

■'H’S • □ 


(a) Linear combination of cut vectors 


(b) Hierarchical cuts 


Figure 1: (a) Any partitioning X can be represented as a linear superposition of cuts Z where 
each cut isolates a connected component of the partition and is assigned a weight 7 = ^ ll^ . By 
introducing an auxiliary slack variables P, we are able to represent a larger set of valid indicator 
vectors X using fewer columns of Z. (b) By introducing additional slack variables at each layer of 
the hierarchical segmentation, we can efficiently represent many hierarchical segmentations (here 
X^}) that are consistent from layer to layer while using only a small number of cut indi¬ 
cators as columns of Z. 


where the vector 6 > G specifies the edge weights. For the case of planar graphs, any solution to 
this LP relaxation satisfies the cycle inequalities (see Appendix [A| and ifT^fT^fTOl ). 

Expanded Multicut Objective: Since the matrix Z contains an exponential number of cuts, Eq. 
is still intractable. Instead we consider an approximation using a constraint set Z which is a subset 
of columns of Z. In previous work 1^ , we showed that since the optimal multicut may no longer 
lie in t he sp an of the reduced cut matrix Z, it is useful to allow some values of Z 7 exceed 1 (see 
Figure [T^ for an example). 

We introduce a slack vector /3 > 0 that tracks the presence of any “overcut” edges and prevents 
them from contributing to the objective when the corresponding edge weight is negative. Let 0~ = 
min(^e 5 0) denote the non-positive component of 6 >e- The expanded multi-cut objective is given by: 


miiiO' Z^ — 0 'p s.t. Z^ — P<1 ( 8 ) 

7>0 

/ 3>0 

For any edge e such that any decrease in the objective from overcutting by an amount Pe it 

is exactly compensated for in the objective by the term —0~Pe. 

When Z contains all cuts (i.e., Z = Z) then Eq|^and Eqj^are equivalent 1201 . Further, if 7 * is the 
minimizer of Eq^when Z only contains a subset of columns, then the edge indicator vector given 
by X = min(l, X 7 *) still satisfies the cycle inequalities (see Appendix [ a| for details). 


4 Relaxing Ultrametric Rounding 


To relax the ultrametric rounding problem, we replace the multicut problem at each layer I us¬ 
ing the expanded multicut objective described by Eq ^ We let 7 = {7^,7^,7^ .. .7^} and 
P = {P^^ P‘^^ p^ ... P^} denote the collection of weights and slacks for the levels of the hierar¬ 
chy and let 0^^ = max( 0 , 6 >g) and = min( 0 , 6 >g) denote the positive and negative components 
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of OK We write the relaxed ultrametric rounding problem as: 

L 

min^((9^.Z7^-6>-^-/3^) (9) 

^|o ^=1 

s.t. < zy V/ < L 

zy - < 1 V/ (10) 

where we have dropped the / = 0 term from Eq[^ which is a constant. 

Expanded Ultrametric Cut Cone Objective: As with Eq it is computationally useful to in¬ 
troduce an additional slack vector associated with each level I and edge e which we denote as 
a = The introduction of allows for cuts represented by to vio¬ 
late the hierarchical constraint ZjI > However we modify the objective so that violations 

to the original hierarchy constraint are paid for in proportion to . The introduction of a allows 
us to find valid ultrametrics while using a smaller number of columns of Z to be used than would 
otherwise be required (illustrated in Eigure[T(Q. We call this relaxed ultrametric rounding problem 
including the slack variable a the expanded ultrametric rounding objective, written as: 

L L L-1 

mm • ^ 7 * + ^ - 6 »-* • /3' + ^ 6 »+' • a' 

^|o i=l i=l 1=1 

a>0 

s.t. Z 7'+1 + a'+i < Z 7 * + a* 'il < L 
Z 7 ' - /3' < 1 

where by convetion we define = 0. 

Given a solution {a, 13, j) we can recover a relaxed solution to the ultrametric rounding problem 
(Eq.||. over by setting = min(l, max^>^ {Zj'^)e). In Appendix [ b| we demonstrate that 
for any (a, 7 ) that obeys the constraints in Eq[^ this thresholding operation yields a solution A' 

that lies in and achieves the same or lower objective value. 

5 The Dual Objective 

We optimize the dual of the objective in Eq[TT] using an an efficient column generation approach 
based on perfect matching. A detailed derivation is given in Appendix O Briefiy, We introduce two 
sets of Lagrange multipliers uj = and A = {A^, A^, A^ ... A^} corresponding 

to the between and within layer constraints respectively. Eor notational convenience, let = 0. 
The dual objective can then be written as 

L 

max —A^ • 1 (13) 

u;>0,A>0 
- 1=1 

Q-l < -\l \/l 

- 7 - 1 -w*) < 6 »+' \/l 

{0^ + -uj^) ■ Z >0 \/l 

The dual LP can be interpreted as finding a small modification of the original edge weights 0^ so 
that every possible two-way cut of each resulting graph at level I has non-negative weight. Observe 
that the introduction of the two slack terms a and (3 in the primal problem (Eq[^ results in bounds 
on the Lagrange multipliers A and uj in the dual problem in Eq[^ In practice these dual constraints 
turn out to be essential for efficient optimization and constitute the core contribution of this paper. 

6 Solving the Dual via Cutting Planes 

The chief complexity of the dual LP is contained in the constraints including Z which encodes 
non-negativity of an exponential number of cuts of the graph represented by the columns of Z. To 


( 11 ) 


( 12 ) 
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circumvent the difficulty of explicitly enumerating the columns of Z, we employ a cutting plane 
method that efficiently searches for additional violated constraints (columns of Z) which are then 
successively added. 

Let Z denote the current working set of columns. Our dual optimization algorithm iterates over 
the following three steps: (1) Solve the dual LP with Z, (2) find the most violated constraint of the 
form ' Z > 0 for layer /, (3) Append a column to the matrix Z for each 

such cut found. We terminate when no violated constraints exist or a computational budget has been 
exceeded. 

6.1 Finding Violated Constraints 

Identifying columns to add to Z is carried out for each layer I separately. Finding the most violated 
constraint of the full problem corresponds to computing the minimum-weight cut of a graph with 
edge weights 0^ — ujK If this cut has non-negative weight then all the constraints are 

satisfied, otherwise we add the corresponding cut indicator vector as an additional column of Z. 

To generate a new constraint for layer I based on the current Lagrange multipliers, we solve 

= arg ^min^ ^ {0i + - uji)ze (14) 

c^E 

and subsequently add the new constraints from all layers to our LP, Z ^ [Z, ... z^]. 

Unlike the multicut problem, finding a (two-way) cut in a planar graph can be solved exactly by a 
reduction to minimum-weight perfect matching. This is a classic result that, e.g. provides an exact 
solution for the ground state of a 2D lattice Ising model without a ferromagnetic field HU 11113 uni 
inO(A^t log AT) time (B). 

Computing a lower bound: At a given iteration, prior to adding a newly generated set of constraints 
we can compute the total residual constraint violation over all layers of hierarchy by A = ^i{0^ + 
' zK In Appendix [ d| we demonstrate that the value of the dual objective plus |A 
is a lower-bound on the relaxed ultrametric rounding problem in Eq[TT] Thus, as the costs of the 
minimum-weight matchings approaches zero from below, the objective of the reduced problem over 
Z approaches an accurate lower-bound on optimization over U l 

6.2 Implementation Details 

Expanding generated cut constraints: When a given cut z^ produces more than two connected 
components, we found it useful to add a constraint corresponding to each component, following the 
approach of 1201 . Let the number of connected components of z^ be denoted M. For each of the M 
components then we add one column to Z; one corresponding to the cut that isolates each connected 
component from the rest. This allows more flexibility in representing the final optimum multicut as 
superpositions of these components. In addition, we also found it useful in practice to maintain a 
separate set of constraints Z^ for each layer 1. Maintaining independent constraints Z^, Z^,..., Z^ 
can result in a smaller overall LP. 

Speeding convergence of uj: We found that adding an explicit penalty term to the objective that 
encourages small values of uo speeds up convergence dramatically with no loss in solution quality. 
This penalty is scaled by a parameter e = 10“^ which is chosen to be extremely small in magnitude 
relative to the values of 0 so that it only has an influence when other no other “forces” are acting on a 
given term in uj. With this refinement, the LP solved at each iteration of the cutting plane algorithm 
is given as follows. 

L 

max —A^l — e||ci;||i (15) 

uj>0,X>oZ^ 

- 1=1 

s.t. < -X^ V/ 

- -J) < V/ 

{0^ ^ X^ -J)Z >0 V/ 


6 


Algorithm 1 Dual Ultrametric Rounding via Cutting Planes 

^ V/, residual ^- oo 

while residual < 0 do 

{cc}, {A} ^ Solve Eg [l5| given Z 
residual = 0 

for / = 1 : L do 

^ argmin; 2 ecuT(^^ + — uj^) • z 

residual ^ residual + |(6>^ + A^ + — uj^) • z^ 

{z{l), z{2 ),..., z{M)} ^ isocuts( 2 :^) 

Z^ ^ Z^U{z{l),z{2),...,z{M)} 

end for 
end while 


6.3 Primal Decoding 

Algorithm gives a summary of the dual solver which at termination produces a lower-bound as 
well as a set of cuts described by the constraint matrices ZK The subroutine isocuts(z^) computes 
the set of cuts that isolate each connected component of z^ 

To generate a hierarchical clustering, we solve the primal, Eq0 using this reduced set Z in order to 
recover a fractional solution = min(l, mdiKm>i{Z'^J^)e)- We use an LP solver (IBM CPLEX) 
which provides this primal solution “for free” when solving the dual in Algorithm 1. 

We round this fractional solution to a discrete hierarchical clustering using a simple thresholding 
strategy. We threshold the fractional X as follows: X^ ^ [X^ > t]. We then repair any cut edges 
that lie inside a connected component by setting them to zero to assure that X^ G MCUT. In our 
implementation we test a few discrete thresholds t G {0,0.2,0.4,0.6,0.8} and take that threshold 
that yields X with the lowest cost. After each pass through the loop of Alg. [^we compute these 
upper-bounds and retain the optimum solution observed thus far. 


7 Experiments 

We applied our algorithm on segmentation problems based on images from the Berkeley Segmen¬ 
tation Data set (BSDS) (Tbl. To construct our input graph we use superpixels generated by per¬ 
forming an oriented watershed transform on the output of the global probability of boundary (gPb) 
edge detector uni. The vertices of the graph are superpixels and edges connect superpixels that are 
neighbors in the image, yielding a planar graph. 

We construct base distance costs 0 by using the log-odds ratio of the local estimate of boundary 
contrast given by averaging gPh classifier output over the boundary between neighboring super¬ 
pixels to yield a value gPhe. We truncated extreme values to enforce that gPhe G [e, 1 — e] with 

e = 0.001. We set Oe = log + log (^) The additive offset assures that 6>e > 0. In our 

experiments we use a fixed set of eleven distance threshold levels {Si} that uniformly spanned the 
useful range of threshold values [9.6,12.6]. We weighted edges proportionally to the length of the 
corresponding boundary in the image. We performed dual cutting plane iterations until convergence 
or 2000 seconds had passed. Lower-bounds for the BSDS segmentations were on the order of —10^ 
or —10^. We terminate when the total residual is greater than —2 x 10“^. All codes were written in 
MATLAB using the Blossom V implementation of minimum-weight perfect matching ina and the 
IBM ILOG CPLEX LP solver with default options. 

7.1 Qualitative and Quantitative Results on Images 

Pigsshow qualitative results for two images from the BSDS test data set. We display segmenta¬ 
tions at eleven thresholds and color connected components of the segmentation at each layer with the 
average pixel color over that component. In Pig we show the comparison of our ultrametric round¬ 
ing algorithm (Alg 1,denoted UM) with the baseline ultrametric contour maps algorithm (UCM) 
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Figure 2: Top left to bottom right: A hierarchical image segmentation for a BSDS test set image 
showing eleven layers listed from fine to coarse. The original image is in the top left. 


with and without length weighting Q . UCM performs agglomerative clustering algorithm, succes¬ 
sively merging segments with small boundary strengths to produce a hierarchical segmentation. We 
display a precision recall plot on the Berkeley Segmentation Data Set test set. 

In terms of segmentation accuracy, UM rounding performs nearly identically to the state of the 
art UCM algorithm with regards to precision recall which is the standard measure employed in 
the literature. However we show some improvements in high precision range of the curve which 
corresponds to the coarse segmentations. It is worth noting that the BSDS benchmark does not 
provide strong penalties for small leaks between two segments when the total number of boundary 
pixels involved is small. Our algorithm may find strong application in domains where the local 
boundary signal is noisier (e.g., biological imaging) or when under-segmentation is more heavily 
penalized. 

7.2 Objective Cost and Timing Experiments 

In Fig |5|6[ we display plots demonstrating the performance of the optimization routine according to 
eight different measures. The most interesting is the quality of the integer solution. We found the 
upper-bound given by the cost of the decoded integer solution and the lower-bound estimated by 
the dual LP are very close. The magnitude of the integrality gap is typically less than 0.1% of the 
magnitude of the lower-bound and never more than 1%. Convergence of the dual is achieved quite 
rapidly; most instances require less than 100 iterations to converge with roughly linear growth in the 
size of the LP at each iteration as cutting planes are added. 

7.3 Cost Comparison with Ultrametric Contour Maps 

We also compared the ultrametric rounding cost of solutions generated by our approach with costs 
associated with hierarchical clusterings produced by the Ultrametric Contour Map (UCM) length- 
weighted clusterings. This test is perhaps unfair as UCM was not necessarily designed to minimize 
the ultrametric rounding cost but provides a baseline for understanding the rounding objective. 

UCM provides an ultrametric solution denoted U e where U is indexed by e and scaled to lie 
in the range [0,1] with smaller values indicating lower likelihood of a boundary. For each level /, we 
select a threshold g G [0,1] which is used to threshold the UCM ultrametric U. We choose a value 
for q which minimizes the ultrametric rounding error, formally written as: 

miny^ 0l[Ue > q''] (16) 

eeE 
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Figure 3: Top left to bottom right: A hierarchical image segmentation for a BSDS test set image 
showing eleven layers listed from fine to coarse. The original image is in the top left. 
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Precision Recall Curve for UCM VS UM 



Figure 4: We show the comparison of our ultrametric rounding algorithm (UM) with the baseline 
ultrametric contour maps algorithm (UCM) with and without length weighting m We display 
precision recall plots on the Berkeley Segmentation Data Set (BSDS), Observe that UM performs 
nearly identically to the state of the art UCM algorithm with regards to precision recall. However 
we do observe small but significant improvements in high precision range of the curve. We note the 
points plotted on the precision recall curve for UM with black dots. Use of length weighted costs 
are indicated by -\-L. 


Thus the total cost for a given image is: 

L 

J2mmJ20[[Ue> q‘] (17) 

i=l « eeE 

Observe that 0^ < and thus we are guaranteed qi < qi+\ 

In Fig [7] we display a histogram, computed over test image problem instances, of the cost of UCM 
solutions relative to those produced by UM rounding. A value of 1 indicates equality. A value of 
greater than 1 indicates UCM providing lower cost while a value less than 1 indicated UM providing 
lower cost. In no instance did UCM outperform our UM algorithm though our UM algorithm often 
outperformed UCM. 

7.4 Segmentation performance and running time 

While our cutting-plane approach is slower than agglomerative clustering, it is not necessary to wait 
for convergence in order to produce high quality results. We found that while the upper and lower 
bounds decrease as a function of time the clustering performance as measured by precision-recall 
stabilized is often nearly optimal after only ten seconds and is very stable after that. We show PR 
curves at several time points in Fig[^ In Fig we shows a plot of the maximum f-measure of UM 
rounding as a function of time relative to the final values of UCM with and without length weighting. 

7.5 Importance of enforcing hierarchical constraints 

Although independently finding multicuts at different thresholds often produces hierarchical cluster¬ 
ings, this is by no means guaranteed. We ran Algorithm 1 while enforcing that cCg = 0 V[e G E^l]. 
This allows the multicut problem for each layer to be solved independently as if the others did not 
exist. To solve these multicut problem instances we used the solver of 1201 . In our data set of 200 
images and 11 layers per problem results in 2200 total multicut instances. The less constrained 
single-layer solver produced a lower or equal cost multicut compared to the hierarchical solver in 
99.77 percent of problem instances. In Fig[^we show examples of hierarchy constraints being vio¬ 
lated severely on multiple images when solving with cc forced to zero. Introduction of the hierarchy 
constraint fixes such errors. 
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Iteration vs Portion Running Time vs Portion Running 




(a) 


(b) 


Time vs Residual 
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(c) 


Iteration vs Dual Time 



Figure 5: (a) We display the portion of the problems that have not terminated as a function of 
cutting-plane iteration. We observe that dual optimization always requires the solution of at least 
a few LP’s for most problems to converge, (b) We display the portion of the problems that have 
not terminated as a function of time. We observe that dual optimization terminates rapidly for most 
problem instances, (c) We plot the value of the average residual constraint violation as a function of 
time averaged over images that have yet to terminate. Instances that terminated before 2000 seconds 
passed have residuals on the order of 10“^ or less. We plot the best observed value in solid blue 
and the current value with dotted blue. We normalize the residual for a given instance by dividing 
by the magnitude of the tightest lower bound for that instance. We indicate the portion of instances 
that have yet to terminate using black bars. The bars are associated with the percent of instances 

incomplete with the bars from left to right being [95,85,75,65,.5]. Observe that the value of the 

residual decays rapidly, (d) We plot the average amount of time per cutting-plane iteration. This 
includes solving one LP and finding the most violated constraint and extracting cuts for each layer. 
We use black bars as in (c) to indicate the percent of problems instances that have not terminated 
after a given time point. 
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Time VS UB (Blue),LB (Red), UB-C (Green) 



(c) 


Time vs Number Columns 
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(d) 


Figure 6: (a) We show a plot over image problem instances that describes the gap between the maxi¬ 
mum lower bound computed for that image and the final rounded integral solution. To normalize the 
energy gap, we scale by the value of the maximum lower bound identified for that problem instance. 
We observe that the rounded integer solutions are near exact or exact on all images, (b) Scatter plot 
of the run time in (sec) versus the minimum magnitude residual (residual is always non-positive). 
We normalize this by dividing by the maximum lower bound over the coarse optimization (denoted 
LB) of the problem instance. Residual was negligible for all except 1 of 200 problem instances 
which did not terminate within 2000 seconds, (c): We show the value of the integer solution and 
lower bound as a function of time averaged over problem instances. We normalize by computing the 
absolute value of the gap between each bound and the magnitude of the maximum lower bound dis¬ 
covered. We plot the value of the upper/lower bounds in blue/red. We plot in green the value of the 
integer solution but include time for rounding the solution after each iteration. We use dotted/solid 
lines to indicate the current/best value observed thus far. We indicate the percentage of instances 

that have yet to terminate using black bars marking [95, 85, 75, 65,.5] percent, (d) We show the 

number of constraints (columns of summed over layers and averaged over problem instances) 
as a function of running time. We use black bars as in c to indicate the proportion of the problems 
instances that have not converged at a given time point. 
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Figure 7: We compare the quality of the ultrametric rounding produced by our ultrametric round¬ 
ing (UM) with the baseline ultrametric contour maps algorithm (UCM) in terms of the ultrametric 
rounding objective. We plot a histogram of the ratio of objective values of UCM and UM. All ratios 
were less than 1 showing that in no instances did UM produce a worse solution than UCM 


Precision Recall Curve for UCM VS UM Precision Recall Curve for UCM VS UM 




(a) Precision Recall Curve after 5 seconds (b) Precision Recall Curve after 10 seconds 


Precision Recall Curve for UCM VS UM Precision Recall Curve for UCM VS UM 




(c) Precision Recall Curve after 15 seconds (d) Precision Recall Curve after 30 seconds 


Figure 8: Anytime performance: We show the precision-recall curve of for segmentations derived 
from the lowest-cost solution decoded at a particular amount of execution time (green curves), stop¬ 
ping at T=5,10,15 and 30 seconds respectively. We conclude that high-tolerance numerical conver¬ 
gence is not necessary to achieve good quality segmentations. For comparison, we plot the UCM 
with and without length weighting in red and blue respectively and the UM results after all problems 
terminate in black. 
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Time vs Maximum F Measure 



Figure 9: Anytime performance: We plot the maximum F-measure on the BSDS benchmark as 
a function of run-time. Clock time includes lower-bound optimization and upper-bound decoding 
after each iteration. We also include the maximum F-measure produced by UCM with and without 
length weighting. The final F-measures achieved by UCM, UCM-fL and UM are 0.728, 0.726, 0.718 
respectively. 


8 Conclusion 

We have introduced a new method for ultrametric rounding on planar graphs that is applicable to 
hierarchical image segmentation. Our contribution is a dual cutting plane approach that exploits 
the introduction of novel slack terms that allow for representing a much larger space of solutions 
with relatively few cutting planes. This yields an efficient algorithm that provides rigorous bounds 
on the quality the resulting solution. We empirically observe that our algorithm rapidly produces 
compelling image segmentations along with lower- and upper-bounds that are nearly tight on the 
benchmark BSDS test data set. 
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Figure 10: Examples where hierarchically nested segmentations give more semantically meaningful 
groupings of the image. The proposed ultrametric rounding (UM) enforces consistency across levels 
while performing independent correlation clustering (CC) at each threshold does not guarantee a 
hierarchical segmentation (c.f. first image). In the second image, hierarchical segmentation (UM) 
preserves semantic parts of the two birds while merging the background regions. In the third image, 
CC merges the background clutter into foreground leaf region at a very low threshold due to a single 
weak edge. 
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A Expanded multicut objective and the cycle inequalities 

In this appendix we show that for planar graphs, solving the expanded multicut optimization pro¬ 
duces solutions that satisfy the cycle inequalities and have equivalent cost when truncated to lie in 

the unit hypercube. This establishes an equivalence between the expanded multicut optimization 

minO ’ Zy — 0~ • P s.t. Zy — P < 1 (18) 

7>0 

/ 3>0 

and the cycle polytope relaxation 

min 0 • X (19) 

xeCYC 

for the case of planar graphs. 
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A.l Multicut cone and Cycle cone 


Recall that CUT and MCUT denote the set of binary indicator vectors that represent valid two-way 
cuts and multicuts respectively for a specified graph G. We denote the conic hulls of these sets by 



( 22 ) 


Finally, we denote the cone of positive vectors satisfying the cycle inequalities by: 

CYC'^ = i X > 0, Xe> X^,Vc eC,eec\ (23) 

V eGc—e ) 

We now state a two basic results concerning these cones. 

Proposition 1: MCUT^ = CUT-^ 

Every cut indicator is a multicut indicator, hence CUT'^ c MCUT^. On the other hand, any 
multicut X G MCUT can be written as a conic combination of cuts that isolate each connected 
component with weight \ so that X = \ with G CUT so MCUT C CUT^ and hence 

MCUT^ C CUT^. 

Proposition 2: If G is planar, CUT'^ = CYC'^ 

A stronger version of this result due to Co) states that for a graph G containing no minor, the 
set of cycle inequalities over chordless circuits is sufficient to specify the facets of the cut polytope 
for G. See (13 (p. 434) for a detailed discussion. 


A.2 The projected solution min(l, Z 7 ) satisfies the cycle inequalities 

As a result of the basic properties of the cut cone, for any 7 > 0, we have Z7 G CYC^ for planar 
graphs. Let X = min(l, Z7) be a solution to the expanded multicut objective and {Z^)e denote the 
value for a particular edge e. It must then be that X G CYC^ since: 

min(l, {Z^)e) > min(l, E {Zl)e) (24) 

eGc—e cGc—e 

> min(l, (Z 7 )e) \lc^G^e^c (25) 

The first inequality arises from pulling the min outside the sum. The second inequality holds since 

Z7 e CYC^ 


A.3 The projected solution min(l, Z 7 ) achieves an objective cost no greater than that of Z 7 

We now demonstrate that the fractional multicut X = min(l, Z 7 ) given by projecting the solution 
E 7 yields a solution with an equal or smaller objective value. 

Recall that /3 is a positive slack variable that allows corresponding edge indicators to take on a value 
greater than 1 . 


Z 7 - /3 < 1 (26) 

Since the objective is non-decreasing in [3, for a given setting of 7 an optimal setting of the slack 
variables is given by: 


/3* = max( 0 , E 7 — 1 ) 


(27) 
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We split the objective into positive and negative edges and write: 

0. Z'y-e- ■ = ■ Z'y + e- ■ Z'y-e- ■ p 

= 9^ ■ Z 7 + 9~ ■ min(l, Z 7 ) 

> 9'^ ■ inin(l, Z 7 ) + 9~ ■ inin(l, Z 7 ) 
= 9 ■ min(l, Z 7 ) 

= 9-X 


(28) 

(29) 

(30) 

(31) 

(32) 


which establishes that projecting Z 7 onto the unit cube yields a fractional multicut solution that 
does not increase the objective. 


B Expanded ultrametric objective and fractional ultrametrics 


Recall the set of fractional ultrametrics is defined as follows 

= {{X\ ... X^] : x' G CYC, x'- > X'+i V^} (33) 

In analogy with the previous appendix, we show the equivalence of the expanded ultrametric round¬ 
ing problem: 


L L L-1 

min • Z7' ^ -6>-' ■ /?' -f ^ 6»+' ■ a' 

^>0 *=i *=i ^=1 

Ck:>0 

s.t. ^ 7 '+^ -I- a'+i < Z 7 ' +a^ \fl<L 
Z 7 ' - /3' < 1 \/l 

with the relaxed problem: 


min 

xeftL 


-X' 

1 = 1 


(34) 


(35) 


(36) 


Given an optimal solution to the expanded ultrametric rounding problem specified by ( 7 , a, /3), we 
produce a fractional ultrametric H by the projection operation: 

= min(l, max(Z 7 ’^)) = max(i^^^^, min(l, {Z^^))) (37) 

m>l 

We show that the resulting projection H yields a valid fractional ultrametric H ^ Ql whose cost is 
no greater than the cost of the corresponding solution to the expanded objective. 


B.l Projecting expanded solutions into Ql 

By construction, H satisfies the hierarchical constraint > 77^+^. We show that G CYC by 
induction. In the previous appendix, we established that = min(l, G CYC. Observe that 
each for I < L is the coordinate-wise max of 77^+^ and min(l, ^ 7 ^), both of which are in CYC 
so we only need show that CYC is closed under coordinate-wise maximum. 

Let and X‘^ be two elements of CYC and X^ = max(X^, X^). We have Me e C,e e c 

^ ^ max(Xi,X2) (38) 

eGc—e eGc—e 

> max( Z ^e) (39) 

eGc—e cGc—e 

>max(X|,X|)=X| (40) 

(41) 

where the first inequality arises from pulling the max outside the sum and the second because X^ 
and each satisfy the cycle inequality. Hence X^ G CYC. 
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B.2 The cost of H is no greater than that of { 7 , a,/?} 


Fixing an optimal solution to the expanded ultrametric problem specified by 7 we first note that the 
optimal values of /3 and a are given by: 


( 3 ^ = max(0, — 1) 
= max(Z7^ - ZV) 

m>l 


(42) 

(43) 


The formula for a can be developed by starting from layer L and working down, setting a to the 
smallest possible value needed to satisfy the inter-layer constraints for a given 7 . 


= 0 


^ = max(0, — Z^^ ^) 
a^-‘^ = max(0, Z-f^ - Z7^-^ ^7^“^ - ^7^“^) 


(44) 


Since the objective is non-decreasing in a and /3, these values are the smallest values for which the 
constraints are satisfied. 

Plugging in the settings of the slack variables for each layer I we have: 

0 i . - 6»-* • / 3 ' + 6»+* • a' 

= (6»+' + 6»“') • Zy - • inax(0, - 1) + 6»+' • max(2’7’” - Zj^) 

m>l 

= 6>+' ■ (Z7' + max(Z7™ - Z7')) + 6>“' ■ (Z7' - max(0, Z7' - 1)) 

m>l 

= d+‘ ■ max 2’7'" + 6»“' • min(l, Zj^) 

m>l 

> 6>+' ■ min(l, maxZ7’") + 6>“' ■ min(l, Z7') 

m>l 

> • min(l, max Zj^) + 0~^ • min(l, max Zj^) 

m>l m>l 

= 0^ ■h’- 

where the second inequality holds because the max introduced is multiplied by a negative weight. 
Since projection can only remain the same or decrease the cost of each layer, the total objective must 
also be no greater than the expanded solution: 


^6»' • Z7* - 6»“' • y + 6»+' • a' > -if* 


C Derivation of Dual Problem 


Here we give a derivation of the dual objective over the expanded ultrametric cut cone which we 
utilize to provide an efficient column generation approach based on perfect matching. 

We introduce two sets of Lagrange multipliers {uj ^... and {A^... A^} corresponding to the 

positivity constraints in Eq[TT] 


L L L-1 

min max O^Z • 7^ - j 3 ^ + 

7>0a;>0,A>0^ ^ ^ 

^>0 l=l l=l ^=1 

Ck :>0 

L-1 


( 45 ) 


^ • 7*+i + a'+i - Z 7 * - a*) 

1 = 1 
L 

^A'(z-v-i-y) 


z=i 
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For notational convenience, we set = 0 and = 0. We reorder the terms of the Lagrangian in 
terms of summations over the primal variable indices. 


L L 

min max V-A^l + - V)/3' 

7>0u;>0,A>0^ ^ 

/3>0 ^=1 ^=1 

a>0 


(46) 


+ - J) • 


1=1 


1=1 


Each primal variable yields a positivity constraint in the dual. 


L 


max y^— A^l 


a;>0,A>0^^ 

- 1=1 


s.t. {-Q-^ - A^) > 0 

VI 

+ 

1 

+ 

T 

IV 

o 

VI 

(6»*+A'+w'-i-a;*)-2’>0 

VI 


This dual LP can be interpreted as finding modification of the original edge weights 0^ so that every 
possible cut of each resulting graph has non-negative weight. Observe that the introduction of the 
two slack terms a and p in the primal problem (Eq[^ results in bounds on the Lagrange multipliers 
A and cc in the dual problem in Eq[^ The constraint ~ > 0 is a result of the introduction 

of pK The constraint uj^~^ — <~U^^ is a result of the introduction of aK In practice these bounds 

turn out to be essential for efficient optimization and are a key contribution of this paper. 

It is also informative to make the substitution fJ = which yields a slightly more symmetric 

formulation 


L 


max — A^l 


(48) 

1=1 

s.t. 0 < a' < -6»“' 

VI 


1 

0< 

VI 

(49) 

m=l 

+ 

VI 

VI 


(6>' + A' - //') ■ Z > 0 

VI 



D Producing a genuine lower bound on the optimal integer solution 


Consider optimizing the Lagrangian over the set of integer solutions T' G VLl. In this case the a, /3 
terms disappear. Eor a given setting of the remaining multipliers uj^Xwe have a lower bound on the 
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toico 


optimal integer solution given by: 


L(w, A) = ^in - X'-) + X\x'- - 1)) 

L 

min + J-^X^ - w'X' + A'X* - A'l) 




1=1 

L 


= mi_ny^(0'+w* ^ - w'+ A*)X* - A^l 

L L 

= -A*l + min ^(6»* + - w* + A*)X' 

1=1 1=1 

L L 

>y"-A'l + y" min {e^ + - J + \^)X^ 

h ^^'6MCUT 

> V - A* 1 + V - min (6>' + - w' + A')X* 

^ ^ 2 x^ecuT 


1=1 


1=1 


(50) 


where the first inequality arises from dropping the constraints between layers of the hierarchy and 
the second inequality holds for planar graphs where the the optimal multi-cut is bounded below by 
the value of the optimal two-way cut (see 1201 ). 
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